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Deep Learning for Human Sensing 


e Requirements for success (from more to less critical) 


Data: A lot of real-world data (and algorithms that learn from data) 
Semi-supervised: Human annotations of representative subsets of data 
Efficient annotation: Specialized annotation tooling 

Hardware: Large-scale distributed compute and storage 

Robustness: Algorithms that don’t need calibration (learn the calibration) 
Temporal dynamics: Algorithms that consider time 


e Current importance relation for successful application of deep learning: 
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Good Algorithms* 


* As long as they learn from data 
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Overview 


* Human Imperfections 

* Pedestrian Detection 

* Body Pose Estimation 

* Glance Classification 

* Emotion Recognition 

* Cognitive Load Estimation 


e Human-Centered Vision for Autonomous Vehicles 
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Humans Are Amazing 
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KACY CATANZARO 


COMPLETED QUALIFYING COURSE 
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Humans Are Amazing 


3.22 trillion miles (US, 2016) e 1 fatality per 80 million miles 


40,200 fatalities (US, 2016) e 1in 625 chance of dying in car crash 
(in your lifetime) 


6: " AM WED. SEP. 3 
2014 (GMT-4) 
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Humans are Flawed 


What is distracted driving? 


Texting 

Using a smartphone 
Eating and drinking 
Talking to passengers 
Grooming 

Reading, including maps 
Using a navigation system 
Watching a video 
Adjusting a radio 
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* Injuries and fatalities: 


3,179 people were killed and 431,000 were 
injured in motor vehicle crashes involving 
distracted drivers 

(in 2014) 


e Texts: 


169.3 billion text messages were sent in the 
US every month. 
(as of December 2014) 


Eye off road: 

5 seconds is the average time your eyes are 
off the road while texting. When traveling 
at 55mph, that's enough time to cover the 
length of a football field blindfolded. 
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Humans are Flawed 





Drunk Driving: In 2014, 31 percent of traffic fatalities involved a drunk driver. 


Drugged Driving: 23% of night-time drivers tested positive for illegal, prescription or 
over-the-counter medications. 


Distracted Driving: In 2014, 3,179 people (10 percent of overall traffic fatalities) were 
killed in crashes involving distracted drivers. 


Drowsy Driving: In 2014, nearly three percent of all traffic fatalities involved a drowsy 
driver, and at least 846 people were killed in crashes involving a drowsy driver. 
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Two Paths to an Autonomous Future 


A1: 


Human-Centered Autonomy 


* Localization and Mapping: 
Where am I? 


* Scene Understanding: 
Where/who/what/why of 
everyone else? 


* Movement Planning: 
How do | get from A to B? 


° Human-Robot Interaction: 
What is the physical and 
mental state of the driver? 


e Communicate: 
How to | convey intent to 
the driver and to the world? 
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A2: 


Full Autonomy 


Blue Text: Easier 
Red Text: Harder 


* Localization and Mapping: 
Where am I? 


* Scene Understanding: 
Where/who/what/why of 
everyone else? 


* Movement Planning: 
How do | get from A to B? 


° Human-Robot Interaction: 
What is the physical and 
mental state of the driver? 


e Communicate: 


How to | convey intent to the 
driver and to the world? 
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Is partially automated driving a bad idea? Observations from an on- 
road study 


Article - April 2018 with 447 Reads & Cite thi uw 
DOI: 10.1016/j.apergo.2017.11.010 0-0 ۴ 


Victoria Banks 
ul 14.44 - University of Southampton 


Alexander Eriksson 
il 11.13 - Swedish National Road and Transport Research Inst... 





Neville A Stanton 


Jim O'donoghue 11 43.23 - University of Southampton 
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Public Perception of What Drivers Do 
in Semi-Autonomous Vehicles 
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MIT-AVT Naturalistic Driving Dataset 
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Processed Trip Data 


Distributed Computing 
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Processed Epoch Data 


Visualizations 
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Human Behavior Shared Autonomy 


Understand Assist Share 
Behavior Behavior Control 








ፈው Scale Naturalistic Data 





5rjs.cn 0000 
| | li E ከ. "ا٣1‎ MIT 6.5094: Deep Learning for Self-Driving Cars Lex Fridman January 
۱ i Technology https://selfdrivingcars.mit.edu lex.mit.edu 2018 


MIT-AVT Naturalistic Driving Dataset 


MIT Autonomous Vehicle 
Technology Study 


Study months to-date: 21 
Participant days: 7,146 
Drivers: 78 

Vehicles: 25 

Miles driven: 275,589 

Video frames: 3.48 billion 
Study data collection is ongoing. 
Statistics updated on: Oct 23, 2017. 


Tesla Model S 
14,117 miles 
248 days in study 


Tesla Model X 
10,271 miles 
366 days in study 


Tesla Model S 
5,186 miles 
91 days in study 


Tesla Model X 
3,719 miles 
133 days in study 
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Tesla Model S 
24,657 miles 
588 days in study 


Tesla Model S 
18,666 miles 
353 days in study 


Tesla Model X 
15,074 miles 
276 days in study 


Volvo S90 
13,970 miles 
325 days in study 


Tesla Model S 
9,188 miles 
183 days in study 


Tesla Model X 
5,111 miles 
232 days in study 


Tesla Model S 
3,006 miles 
144 days in study 
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Tesla Model X 
22,001 miles 
421 days in study 


Range Rover 
Evoque 

18,130 miles 

483 days in study 


Range Rover 
Evoque 

14,499 miles 

440 days in study 


Tesla Model S 
12,353 miles 
321 days in study 


Tesla Model S 
8,319 miles 
374 days in study 


Tesla Model S 
4,596 miles 
132 days in study 


Tesla Model X 
1,306 miles 
69 days in study 
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Tesla Model S 
18,896 miles 
435 days in study 


Tesla Model S 
15,735 miles 
322 days in study 


Tesla Model S 
14,410 miles 
371 days in study 


Volvo S90 
11,072 miles 
412 days in study 


Tesla Model S 
6,720 miles 
194 days in study 


Tesla Model X 
4,587 miles 
233 days in study 


Tesla Model S 
(Offload pending) 
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Cumulative Distance Traveled (miles) 
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100,000 


500+ Miles / Day and Growing 


Mar 2016 Apr 2016 May 2016 Jun 2016 Jul 2016 Aug 2016 Sep 2016 Oct 2016 Nov 2016 Dec 2016 





100-200 miles/day 


400-500 miles/day 


600-700 miles/day 






800-900 miles/day 
Data collection 
is on-going... 1000+ miles/day 


Jan 2017 Feb 2017 Mar 2017 Apr 2017 May 2017 Jun 2017 Jul 2017 Aug 2017 





100 200 300 400 500 600 


Study Duration (days) Srjs.cn በበበበበበ 
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Tesla Autopilot: Patterns of Use 






Autopilot 


( 


Manual 


33.8% of the miles driven are with Autopilot engaged 
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Physical Engagement: 
Glance Classification 


Latest gaze classification: 


mms Manual 
تا‎ Autopilot 


Road: -2.69% 
IC: 0.49% 
Left: 0.78% 


Rearview: 0.11% 
CS: 0.7% 
Right: 0.62% 


Probability of Region (Confidently Classified) 





Road IC Left Rearview CS Right 
Glance Region 
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Semi-Autonomous Driving: 
Observed Patterns of Behavior 


e The “how” of successful human-robot interaction: 


Use but Don t Trust. 


* The "why" of successful human-robot interaction: 


Learn Limitations by Exploring. 
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Deep Learning for Human Sensing 


e Requirements for success (from more to less critical) 


Data: A lot of real-world data (and algorithms that learn from data) 
Semi-supervised: Human annotations of representative subsets of data 
Efficient annotation: Specialized annotation tooling 

Hardware: Large-scale distributed compute and storage 

Robustness: Algorithms that don’t need calibration (learn the calibration) 
Temporal dynamics: Algorithms that consider time 


e Current importance relation for successful application of deep learning: 
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Good Algorithms* 


* As long as they learn from data 
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Overview 


* Human Imperfections 

* Pedestrian Detection 

* Body Pose Estimation 

* Glance Classification 

* Emotion Recognition 

* Cognitive Load Estimation 


e Human-Centered Vision for Autonomous Vehicles 
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Human Sensing: 
A Deep Learning Perspective 


Increasing level of detection resolution and 


a 
Pedestrian Body Head Blink Blink | Eye Blink Pupil Micro 
Detection Pose Pose Rate Duration [ Pose Dynamics Diameter Saccades 
Face Face Glance Drowsiness Micro Cognitive 
Detection | | Classification መ Classification Glances Load 
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* The usual challenges, e.g.: 


* History of object detection 


* VoxelNet (detection in 3D space) 


| | | Bam Massachusetts 
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Pedestrian Detection 


Various style of clothing in appearance 
Different possible articulations 

The presence of occluding accessories 
Frequent occlusion between pedestrians 





Sliding window 
* Haar Cascades 
e Histogram of Oriented Features 
* CNN 
R-CNN, Fast R-CNN, Faster R-CNN 
Mask RCNN (adds segmentation) 
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e Simple algorithm 
* Extract region proposals 


* Use CNN on each one 


R-CNN: Regions with CNN Features 









(selective search) 


(w/ non-maximum suppression) 
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https://selfdrivingcars.mit.edu/references 


* Per 10 hours (1 recording day) 
* 12,000 pedestrians 
* 21,600,000 samples of feature vector 
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Naturalistic Driving Data: 


Pedestrians, Cyclists, Other Cars 





Sony FDR-AX53 ZED Stereo Camera Gear 360 Camera 


7A 0) 
/ HERO4 


GoPro Hero4 Velodyne VLP-16 Velodyne HDL-64E 
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Naturalistic Driving Data: 


Pedestrians, Cyclists, Other Cars 


== 
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Naturalistic Driving Data: 


Pedestrians, Cyclists, Other Cars 
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Naturalistic Driving Data: 


Pedestrians, Cyclists, Other Cars 
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Naturalistic Driving Data: 


Pedestrians, Cyclists, Other Cars 
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Overview 


* Human Imperfections 

* Pedestrian Detection 

* Body Pose Estimation 

* Glance Classification 

* Emotion Recognition 

* Cognitive Load Estimation 


e Human-Centered Vision for Autonomous Vehicles 
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Human Sensing: 
A Deep Learning Perspective 


Increasing level of detection resolution and 


a 
Pedestrian Body Head Blink Blink | Eye Blink Pupil Micro 
Detection Pose Pose Rate Duration [ Pose Dynamics Diameter Saccades 
Face Face Glance Drowsiness Micro Cognitive 
Detection | | Classification መ Classification Glances Load 
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* Pattern of body movement 
* Vertical position in seat 
* General movement 


* Beyond body movemnet 
e Smartphone 
* Hands on wheel 
* Activity 
* Context for DeepGlance 


—— 0 For the full updated list of references visit: 
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Glance Region 


Right 


Confidence: 92% 
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Sequential Detection Approach 


Sequential Upper Body Pose Estimation: 
RGB —— >» Sequential detection ہے‎ Confidences 


Temporal Fusion of Localized Confidences: 














etg ۷ Wrist 
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Charles, James, et al. "Upper body pose estimation with temporal sequential 


forests." Proceedings of the British Machine Vision Conference 2014. BMVA Press, 2014. 
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DeepPose: Holistic View 


٥ Why holistic reasoning? 


* Besides extreme variability in articulations, many of the joints are 
barely visible 
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Cascade of Pose Regressors 









DNN-based refiner 
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Part Detection 


(a) Input image (b) Confidence maps 


Elb(r)-Wri(r); Head-Neck 
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Assemble Parts: Part Affinity Fields 
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(b) Confidence maps (c) PAFs 
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Bipartite Matching 
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Temporal Convolutional Neural Networks 


Input Pose heatmaps Optical flow Warped 
heatmaps 






Pooled 
Temporal heatmap 
Pooler for frame t 
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1 Spatial 
SpatialNet Fusion * EN 
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Output 


SpatialNet 





conv1 conv2 
5x5x128 || 5x5x128 
pool 2x2 [| pool 2x2 






Loss 1 
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Loss 2 

















Pfister, Tomas, James Charles, and Andrew Zisserman. "Flowing convnets for human pose estimation in 


videos." Proceedings of the IEEE International Conference on Computer Vision. 2015. 
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20 Epochs (30 minutes each) 
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Body Pose: 20 Epochs (30 minutes each) 
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Pose Estimation 
(Outside Vehicle Perspective) 
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MIT Pedestrian Dataset 





Estimated Pedestrian Glance and Vehicle Speed — vehicle speed 
i — glances 
10 — current frame 
vere» enter lane 
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MIT Pedestrian Dataset 
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MIT Pedestrian Dataset 
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Overview 


* Human Imperfections 

* Pedestrian Detection 

* Body Pose Estimation 

* Glance Classification 

* Emotion Recognition 

* Cognitive Load Estimation 


e Human-Centered Vision for Autonomous Vehicles 
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Human Sensing: 
A Deep Learning Perspective 


Increasing level of detection resolution and 


"E 
Pedestrian Body Head Blink Blink Eye Blink Pupil Micro 
Detection Pose Pose Rate Duration f Pose Dynamics Diameter Saccades 
Face Face Glance Drowsiness Micro Cognitive 
Detection | | Classification [MM Classification Glances Load 
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Glance Classification vs Gaze Estimation 





Accuracy: 100% 





Accuracy: 1 00% Frames: 1 


Accuracy: 1 00% Frames: 1 Accuracy: 1 00% Frames: 1 
Time: 0.03 secs Time: 0.03 secs Time: 0.03 secs 
Total Confident Decisions: 1 


Total Confident Decisions: 1 Total Confident Decisions: 1 
Correct Confident Decisions: 1 Correct Confident Decisions: 1 Correct Confident Decisions: 1 


Frames: 1 

Time: 0.03 secs 

Total Confident Decisions: 1 
Correct Confident Decisions: 1 


Road Road 








Road 





Frames: 1 
Time: 0.03 secs 


Accuracy: 100% 





بب — 


Accuracy: — —25 Frames: 1 Accuracy: 1 00% Frames: 1 


Accuracy: 1 00% Frames: 1 
Time: 0.03 secs Time: 0.03 secs Time: 0.03 secs 
Total Confident Decisions: 1 


Total Confident Decisions: 1 Total Confident Decisions: O 
Correct Confident Decisions: 1 Correct Confident Decisions: O Correct Confident Decisions: 1 
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Total Confident Decisions: 1 
Correct Confident Decisions: 1 
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Pedestrian Glance Classification 
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Drive State Detection 


* Challenge: real-world data is "messy", have 


to deal with: 
* Vibration 
* lighting variation 
* Body, head, eye movement 


e Solution: 
e Automated calibration 


* Video stabilization (multi-resolutional) 


* Face part frontalization 


* Use deep neural networks (DNN) 


e No feature engineering 
e Use raw data 


BE Massachusetts 
[ i Institute of 
Technology 
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Preprocessing 
Automated 
Calibration 

Video 
Stabilization 


Source Video 







Face 
Frontalization 
Motion 
Magnification 












Raw Features 


Facial Actions Pupil Area Raw Face Image 
| Blink State | Pupil Position Raw Eye Image 


Driver State Detection DNN Models 
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Face Alignment 


ہے 


— ++ Landmarker.io 
* Imperial College London 


* Face in the Wild Challenge 
e XM2VTS 
e FRGC Ver.2 
e LFPW 
* HELEN 
e AFW 
° IBUG 


* New Datasets 
e MPIIGaze 
e Columbia Gaze 
e 300VW 





5ris.cn ۳ 





itu a” MIT 6.5094: Deep Learning for Self-Driving Cars Lex Fridman January 
nology https://selfdrivingcars.mit.edu lex.mit.edu 2018 


Gaze Classification Pipeline 


Face detection (the only easy step) 

Face alignment (active appearance models or deep nets) 
Eye/pupil detection (are the eyes visible?) 

Head (and eye) pose estimation (+ normalization) 


Classification (supervised learning = improves from data) 


EY d c و‎ pe د‎ 


Decision pruning (how confident is the prediction) 


Road Road 


Frames: 1 Accuracy: Frames: 1 Accuracy: 
Time: 0.03 secs Time: 0.03 secs 
Total Confident Decisions: 1 Total Confident Decisions: 1 
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Annotation Tooling 


“Semi-automated”: 


Ask a human for help with annotation 
when the machine is not confident. 


Partial light Full light Move out of Hand 
occlusion occlusion frame occlusion 
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Semi-Automated Annotation Work Flow 


* Human in red and machine in blue 


Select and load in video of driver face. 

Detect face: have we seen this person before? 

Localize camera: have we seen this angle before? 

Provide tradeoff between accuracy and percent frames. 

Select target accuracy: 9596, 9996, or 99.996 

Perform gaze classification on full video (1 hour per 1 hour of video) 
Step through and annotate the frames machine did not classify. 
(Optional) Re-run steps 6 and 7. 

Enjoy fully annotated video! 


و و ۍ د ہی pe oe cw ንኢ‏ 
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Real-Time Glance Classification 






Latest gaze classification 


Autopilot Status: 


Avg wn/mi 


560 4 
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Overview 


* Human Imperfections 

* Pedestrian Detection 

* Body Pose Estimation 

* Glance Classification 

* Emotion Recognition 

* Cognitive Load Estimation 


e Human-Centered Vision for Autonomous Vehicles 
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Human Sensing: 
A Deep Learning Perspective 


Increasing level of detection resolution and 





Pedestrian Body Head Blink Blink Eye Blink Pupil Micro 
Detection Pose Pose Rate Duration || Pose Dynamics Diameter Saccades 
٢ Face Glance | ول‎ Micro Cognitive 
Detektion | | Classification [| Classification Glances Load 
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Drive State Detection 


* Challenge: real-world data is "messy", have 


to deal with: 
* Vibration 
* lighting variation 
* Body, head, eye movement 


e Solution: 
e Automated calibration 


* Video stabilization (multi-resolutional) 


* Face part frontalization 


* Use deep neural networks (DNN) 


e No feature engineering 
e Use raw data 


BE Massachusetts 
[ i Institute of 
Technology 
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Preprocessing 
Automated 
Calibration 

Video 
Stabilization 


Source Video 







Face 
Frontalization 
Motion 
Magnification 












Raw Features 


Facial Actions Pupil Area Raw Face Image 
| Blink State | Pupil Position Raw Eye Image 


Driver State Detection DNN Models 
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Emotion Recognition 


e Many ways to taxonomize 
emotion. 


* Example: 
Parrot's primary emotions: 


1 
À / ۱ 
9 | 7 ج‎ 
ove aggressiveness , T *, submission 
4 * 
* 





e Joy 


* Surprise C 


* Anger 





7 





1 
1 
1 


a / i 
ኔ , 7 
ወ ኒ å 5 7 
S a d n e S S contempt Å N awe 
4 7 
` / 
e ۱ í 
F ear boredom 4 ኣሥ distraction ۱ 


* Two approaches 


e General remorse ` ^ V disapproval 
* Application-specific 
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Building Blocks: Facial Expressions 


e 42 individual facial muscles in the face. 


d 


> 
— 
ڪڪ 
کہ 
5 
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A 
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General Emotion Recognition 
Example: Affectiva SDK 





ÆR 


Anger Contempt Disgust Fear 





Joy Sadness Surprise 
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General Emotion Recognition 
Example: Affectiva SDK 


Emotion Increase Likelihood Decrease Likelihood 


Brow Raise 
Brow Furrow 





Brow furrow 





Lid Tighten E 
Inner Brow Raise 
Eye Widen 
Anger Au ۱ Brow Ralse 
Chin Raise Smil 
| Mie 
Mouth Open 
Lip Suck 
D : Nose Wrinkle Lip Suck 
Isgus | 
Upper Lip Raise Smile 
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Application-Specific Emotion Recognition: 


Driver Frustration 





Class 1: 





> 


Class 2: 





with Voice-Based Interaction 





Gender: Male 

Glasses:) No 
Interocular distance:$182:1 
Mean Face luminance:$205.0 





















:brow furrow 
¿chin raiser 
:disgust (image) 
‘eyes closed 


= " Pitch angle: cl inner brow raise 
` Roll EIEN 2443) ‘lip depressor 
Yaw/ angle SEE ‘lip press 
anger: تت‎ ‘lip pucker 
contempt: ‘lip raiser 
4 EEN ‘lip suck 
Irearj ሐ :mouth open 
y oy: E :nose wrinkle 
» i sadness: ‘outer brow raise 
A, surprise: «smile 
js :smirk 
xpressiveness: :smirk (left) 


uns La 


with Voice-Based Interaction 




















:smirk (right) 













E 
Gender: Male ID :brow furrow 
Glasses: Yes å 
Interocular distance: 164.8 
Mean Face luminance: 140.5 
Pitch angle: 7.4 
Roll angle: —5.5 
Yaw angle: —8.0 
anger: 
contempt: 
disgust: ‘lip suck 
: fear: :mouth open 
joy: ¿nose wrinkle 
sadness: "outergbrowgraise 
surprise: 








xpressiveness: 


ኑኔ 










smirk (left) 





:smirk (right) 
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Emotion Generation 
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Overview 


* Human Imperfections 

* Pedestrian Detection 

* Body Pose Estimation 

* Face Detection 

* Glance Classification 

* Emotion Recognition 

* Cognitive Load Estimation 


e Human-Centered Vision for Autonomous Vehicles 
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Human Sensing: 
A Deep Learning Perspective 


Increasing level of detection resolution and 


a 
Pedestrian Body Head Blink Blink | Eye Blink Pupil Micro 
Detection Pose Pose Rate Duration [ Pose Dynamics Diameter Saccades 
Face Face Glance Drowsiness Micro Cognitive 
Detection | | Classification መ Classification Glances Load 


















5ris.cn ۳ 





| ۱ | [ | see ra "80 MIT 6.5094: Deep Learning for Self-Driving Cars Lex Fridman January 
i m | ۱ institute o 2 وت‎ 2 | 091 





Eye in Motion: 
Saccades 


Right 


— Upper eyelid 


| Pupil 
Sclera 


Lower eyelid 


Ballistic movements Left | 
Time 

Can be small or large 

(reading vs exploring the room) 


Can be voluntary or reflexive 


During 200ms period: compute the position of target with respect to 
fovea and convert to motor command 


The eye movement is 15-100 ms 


If target moves during eye movement, adjustments have to be made 
after movement is completed. 
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— Eye position 


/ Target position 






January 
2018 


209/s 


Eye in Motion: 20 
Smooth 
Pursuits 


159s 






15 


Target movement 10% 


10 
Catch-up 
saccade 


Eye movement (degrees) 


Eye movement 


0 0.5 1.0 13 
Time (s) 


* Slower tracking movements that keep stimulus on the fovea 


* Voluntary in that observer can choose whether or not to track 
moving stimulus 


* Only highly trained observers can make a smooth pursuit movement 
in the absence of a moving target 
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oa Human microsaccades (Video tracking) 


۷ آ0‎ | O n D u 1 n E F ixa t | () |] | ae RR (Search coil) 


a 0.75 Hafed et al., 2009 
a — 12 arcmin 
* Drifts: > 8 | 
. « ሽ Dd 
slow movements away from fixation point, IP 
20 to 40 Hz SS 
| | X: 
* Flicks (microsaccades): S 0.25 
reposition the eye on target, 1 degree max 
° : 0 - | | —— 
* Ocular micro tremors: በ 5 30 4 60 
150-2500nm, 40-100172 Microsaccade amplitude (arcmin) 


Å Å 
=. D 











1 Source (500 FPS) Motion magnified x75 (30-50 Hz) 
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Cognitive Load Overview 
From the Perspective of Computer Vision 


* Each of the following bullet points have several papers validating it. 


* Pupil equations: 
* Brighter light = smaller pupil 
* Higher cognitive load = larger pupil 


* Blink equations 
* Higher cognitive load = slower blink rate 
e Higher cognitive load = shorter blink duration 


* Questions: 
* Which of these metrics can be accurately extracted in real-world driving data? 
٭‎ Arethere other metrics that may work better in such conditions? 
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3D Convolutional Neural Networks 


temporal 
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Real-World Data 


92 drivers perform "n-back" tasks requiring various levels of 
cognitive load: 


e . O0-back: Say the number right after it's read 
e 1-036): Say the number previous to the current one. 
e 2-back: Say the number 2 prior to the current one. 


tts 


r the full updat 
mnm selfdrivingc 





Auditory N-Back Task 


ም‏ ےا ےا ےا“ 


300ms 068 


Match 


300 ms 068 


o-match . | Match 


"188 81818/8 
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Cognitive Load Estimation 


Low CL Medium CL High CL 
(O-back) (1-back) (2-back) 


| | |] BEN Massachusetts 


Extract Pupil | 
| Position N i : 
3D-CNN Model | 


ኩመሬመመመመመመወመመመመመመመመመመመ መመመ ወመ اہ ہے ہہ‎ 


Cognitive Load | 
Classification ! 
Decision 





Eye Image Sequence 


e 6 seconds, 16 fps, 90 images 
Two approaches: HMM and 3D-CNN 


HMM: Hidden Markov Model 


* Input: Sequence of pupil positions 
(normalized by intraocular segment) 


3D-CNN: Three Dimensional 
Convolutional Neural Network 


* Input: Sequence of raw images of eye region 





Sn E NN e ooo ےہ سوہ ..مے۔‎ 





lastituto of MIT 6.5094: Deep Learning for Self-Driving Cars Lex Fridman January 
Technology https://selfdrivingcars.mit.edu lex.mit.edu 2018 


Dealing with Vibration and Movement 





(Remove effects of head movement) 


Original Video AAM Landmarks  Frontalized Video 
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Preprocessing Pipeline 


- 





1. Face Detection 2. Face AAM (43 pts) 3. Face Frontalization 





Raw 4 Features 





4. Eye Lid AAM (25 pts) 5. Classify Pupil Visibility 6. Pupil AAM (39 pts) 
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Vertical Pupil Movement (Normalized) 


Visualizing the Dataset: Pupil Movement 





0-Back (Low Cognitive Load) 1-Back (Medium Cognitive Load) 2-Back (High Cognitive Load) 

0.4 0.4 
T S 

0.3 g 0.3 2 0.3 
© © 

0.2 E 02 E 02 
O O 
e e 

0.1 | 2 041 2 0. 
À ፎ © 
| 2: 0 ወ 
> > 
0 | 0 

و0 = 041 0.1- 
= = 
2 2 

-0.2 oO -0.2 a -0.2 
© © 
g 2 

-0.3 5 -0.3 5 -0.3 
> 2 

4 -0.4 -0.4 

-0.4 ፦0.3 -0.2 -0.1 00 01 02 03 0.4 -0.4 -0.3 -0.2 -0.1 0.0 01 02 03 0.4 -0.4 -0.3 -0.2 -01 00 01 02 03 0.4 
Horizontal Pupil Movement (Normalized) Horizontal Pupil Movement (Normalized) Horizontal Pupil Movement (Normalized) 


Metric: Pupil position normalized by intraocular distance 
Visualization: Kernel density estimation (KDE) 
Dataset size: 92 subjects 


Takeaway: Observable aggregate differences between all 3 levels 
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Cognitive Load Estimation 


Extract Pupil HMM 
Position 


Model Cognitive Load 
Classification 


Decision 
Eye Image Sequence 3D<NNModel O ፦ | ዐ0 .:9ዐ .- | 
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HMM: Hidden Markov Model 3D-CNN: Three Dimensional 
| 75 Convolutional Neural Network 
Input: Sequence of pupil positions 
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A: O-back 


B: 1-back 


C: 2-back 
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75.9% 





Driver Cognitive Load Estimation 


18.1% 


73.4% 16.8% 0.5 














10.2% 


83.7% 


HMM Approach 


Average Accuracy: 77.7% 
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A: O-back 


B: 1-back 


C: 2-back 
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3D-CNN Approach 
Average Accuracy: 86.196 
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Cognitive Load Estimation: 
Open Source - Open Innovation 


Implication: Make driver cognitive load estimation accessible 


Low 
ኣ Cognitive Load 
| Medium 


Cognitive Load 





ሠ 
ጋመ 





Webcam 
Video Stream High 
Cognitive Load 
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Real-Time Cognitive Load Estimation 





DeepCogLoad DeepCogLoad 
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Overview 


* Human Imperfections 

* Pedestrian Detection 

* Body Pose Estimation 

* Face Detection 

* Glance Classification 

* Emotion Recognition 

* Cognitive Load Estimation 


٠ Human-Centered Vision for Autonomous Vehicles 
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Human-Centered Artificial Intelligence Approach 


No Yes 
0 y Human | 0 
90 0 Needed 10 70 
Solve the perception-control And where 
problem where possible: involve the human 
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Human at the Center of Automation: 
The Way to Full Autonomy Includes the Human 





Fully 
Machine 
Controlled 


Fully 
Human 
Controlled 





Ford F150 Tesla Model S Google Self-Driving Car 
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Testing Dataset 
















RD” 


መመ Tesla Model S... 


Testing Dataset 


Training Dataset 


ውና‏ ھی دای کک کک کک 
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Human-Centered Autonomy 


A self-driving car may be more a Personal Robot and less 
a perfect Perception-Control system. Why: 


* Flaws need humans: The scene understanding problem 
requires much more than pixel-level labeling 


* Exist with humans: Achieving both an enjoyable and safe 
driving experience may require "driving like a human". 


Quite possibly, the first wide reaching and profound 
integration of personal robots in society. 


* Wide reaching: 1 billion cars on the road. 


* Profound: Human gives control of his/her life directly to 
md robot. 


* Personal: One-on-one relationship of communication, 
collaboration, understanding and trust. 
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Human (and Machine) Imperfections 


* "People call these things 
imperfections, but they're not. That's 
the good stuff..." 


* "And then we get to choose who we let 
in to our weird little worlds. You're not 
perfect, sport. And let me save you the 
suspense. This girl you met, she isn't 
perfect either. But the question is: 
whether or not you're perfect for each 
other. That's the whole deal. That's 
what intimacy is all about..." 





* "Now you can know everything in the 
world, sport, but the only way you're 
finding out that one is by giving it a 
shot." 
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۷ HCAV: Human-Centered Autonomous Vehicle 
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CHI 2018 Course: 
Deep Learning for Understanding the Human 


e Part 1 (80 minutes) 
e Introduction to Deep Learning 


| "P | | e Theory, insights, and intuitions 
| | e Tools to get started applying DL to various domains 
Convolutional Neural Networks 


Engage with CHI | e Face recognition 


e Eye tracking 





* Cognitive load estimation 


* Emotion recognition 


* Part 2 (80 minutes) 
* Recurrent Neural Networks 
e Natural Language Processing 
e Voice Recognition 
* Mixing Convolutional and Recurrent Neural Networks 


e Activity recognition 


e Part 3 (80 minutes) 


* Generative Neural Networks 
e Speech Synthesis 





* Peripheral Vision Visualization 


»js.cn [Im 
| | | i me Massachusetts MIT 6.5094: Deep Learning for Self-Driving Cars Lex Fridman January 


Institute of p 
Technology https://selfdrivingcars.mit.edu lex.mit.edu 2018 








* dates, times, rooms in red are different than the usual 6.5099 
Mon, Jan 22 Lex Fridman, MIT Artificial 
7pm,54-100 Artificial General Intelligence G | 
| | A enera 
Tue, Jan23 Josh Tenenbaum, MIT - 
፦ Y Y 8 ሽ i} ፦ A 7pm, 54-100 Computational Cognitive Science Intelligence 
agi.mit.edu 
Wed, Jan 24 Ray Kurzweil, Google 
lpm, 10-250 Howto Create a Mind 
Thu, Jan25 Lisa Feldman Barrett, NEU 
7pm, 54-100 Emotion Creation 
= Fri, Jan26 Nate Derbinsky, NEU 
r 7pm, 54-100 Cognitive Modeling 
Mon, Jan29 Andrej Karpathy, Tesla 


1:30pm. 26-100 


Mon,Jan29 Stephen Wolfram, Wolfram Research 
7pm, 54-100 Knowledge-Based Programming 
Tue,Jan30 Richard Moyes, Article36 


Deep Learning 


7pm,54-100 Al Safety: Autonomous Weapon Systems 


Wed,Jan31 Marc Raibert, Boston Dynamics 
7pm, 54-100 Robots That Work in the Real World 


Thu,Feb1  llyaSutskever, OpenAI 


7pm,54-100 Deep Reinforcement Learning 





Fri, Feb Z2 Lex Fridman, MIT 
7pm, 54-100 Human-Centered Artificial Intelligence 





DeepTraffic 


Main Page - Leaderboard - About DeepTraffic 
Americans spend 8 billion hours stuck in traffic every year 
0 Deep neural networks can help! 0 





1885553155 - 3; 





ODE BE mcm Án * Competitions 

188፡፡| ' ሸ * Ongoing until May 2018. Results, insights > NIPS 2018‏ سے 
e DeepTraffic: https://selfdrivingcars.mit.edu/deeptraffic‏ مت تفع E‏ 
وی of 2 CED‏ 





3 Eos T e SegFuse: https://selfdrivingcars.mit.edu/segfuse 
A يد ر‎ አ ው ریس مت‎ AU BEEN اس‎ . . . 
—. wo E B e DeepCrash: https://selfdrivingcars.mit.edu/deepcrash 


* Upcoming MIT Courses: 


e 6.5099: Artificial General Intelligence 
https://agi.mit.edu 

* 6.5191: Introduction to Deep Learning: 
http://introtodeeplearning.com 


e 15.514: Global Business of Al & Robotics 
http://tiny.cc/gbair18 


Learning Episode 200 


e |f you're interested in the application of deep 
learning in the automotive space, come do 


research with us: https://hcai.mit.edu/join 
(opens in Feb 2018) 
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Thank You 








Lex Fridman 


|ack Terwilliger julia Kindelsberger 
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